NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part1 / 5954 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 2.8 KB

Path: pegasus.montclair.edu!harmon From: harmon@pegasus.montclair.edu (Derek Harmon) Newsgroups: comp.lang.c Subject: Re: unique id for a string Date: 18 Feb 1996 17:20:46 -0500 Organization: Montclair State University Message-ID: <harmon.824680556@pegasus.montclair.edu> References: <1996Feb16.175601.114182@kuhub.cc.ukans.edu> NNTP-Posting-Host: pegasus.montclair.edu X-Newsreader: NN version 6.5.0 #68 (NOV) anh@kuhub.cc.ukans.edu writes: >This is not a quite a C problem. But since the implementation will be in >C anyway. Then I won't quite give an answer in C. :) >Given an alphabet, for example, E = {A-Z}, I need to assign an unique id >to each possible strings (finite, approx 32 chars long, or whatever, but >it is finite) formed by the letters from E. I would like the id to be as >small as possible. Is there a good formula? >HELLO = H * 26^4 + E * 26 ^3 + L * 26^2 + L * 26^1 + E * 26^0 If the possible strings had to be English, a slight improvement (smaller id for more strings) could be attained by ordering the ASCII char values to alternate values for the corresponding character which are used less in the language (ie, vowels would be ASCII 1-5, S could be 6, and X could be 26). >But the id is way too big. I need something that fits within a 32-bits int >or a 64-bits long. >Another way is to simply assign a id to a word and store the word in a >binary search tree, and therefore, it is relatively fast to make sure >each word has an unique id. But I think it is still to slow. We are >talking about a possible 200,000+ words. How balanced the tree would be depends on either the order of input, or the tree's self-balancing of itself which makes input longer, but logarith- mically lowers the bound on search time. But we are talking about more than 200,000 words here, and that is where the problem occurs. If the max length of these strings is 32 characters, then we are talking about roughly 7.3 x 10^43 different strings. That will not fit into a 64-bit long. :) The formula you give above will give all those strings distinct id nums, if you have knowledge of the probability of occurence of symbols from that language, you can shift the probability of smaller id num size (there will still be large id nums) for a random sampling of those strings. However, if symbols have equal probabilities of selection, finnagling of the formula won't add any advantage. If it were possible, the world would be a path to your door for an infinite data compression method. -- Stone -- # Derek Harmon (aka Stonelight) harmon@pegasus.montclair.edu # - Computer Science Undergrad, Montclair State University, NJ # - My views are my own, nobody else is this creative. 3;)> ... Recursive. adj., see Recursive.